ML Questions

1. ML Foundamentals

1.1 Datasets

1.1.1. Data Preparation Process

  1. Data Collection with random or stratified sampling
  2. Data Cleaning, include check for missing or duplicate data, identify outliers, remove irrelevant information and correct any errors
  3. Data Labeling
  4. Data Spliting into training, validation, and test sets; training set for weight learning, validation set for hyperparameters tuning, test set for performance evaluation
  5. Data Preprocessing with normalization, scaling or transforming
  6. Balance Checking
  7. Data Shuffling to reduce bias and eliminate order-based information

1.1.2. Verify Collected Data is Suitable?

1.1.3. Handle Label Imbalance

1.1.4. Handle Missing Labels

1.2 Featuers

1.2.1 Types of Input Features

1.2.2 What is Feature Selection / Importance

1.2.3 How to Feature Selection

1.2.4 Handle Missing Values

1.3 Modeling

1.3.1 Common modeling algorithms

1.3.2 Common loss functions

1.3.3 Second-Order Optimization Methods

1.3.4 Common Optimization Algorithms

1.3.5 Hyperparameter Tuning

1.3.6 Overfitting

1.3.7 Regularization (Grab Inteview Question on 2019.08)

1.3.8 Linear and Logistic Regression

1.3.9 Activation Functions

1.3.10 Difference between boosting and bagging (Grab Glassdoor interview 2019.07)

1.3.11 Vanishing and Exploding Gradient

1.3.11 Unsupervised Learning

1.4 Evaluation

1.4.1 Determine whether a loss function is convex

1.4.2 Evaluation Metrics

1.4.3 Why some metrics are not used for optimize a model

1.4.4 Debug a poorly performing model

2. ML System Design

--- config: theme: redux --- flowchart TB Pre --> Design subgraph Pre cp["`**Clarify Problem**<i> (1) Why (2) what metrics (3) what contents (4) how to blend contents (5) operational parameters of RecSys </i> `"]-->hd["`**High Level Design**<i> (1) High-level design (2) cold-start </i> `"] end subgraph Design DC["`**Data Collection & Preprocessing**<i> </i> `"] FE["`**Featuer Engineering**<i> </i> `"] MEM["`**Modeling & Evaluation Metrics**<i> </i> `"] DE["`**Deployment**<i> </i> `"] DC --> FE --> MEM --> DE end

2.1 Design Framework

  1. Clarify the problem: Why to do? Who are end users? What metrics? Size of dataset; latency requirements
  2. High-level design: identify inputs, intermediate stages, and outputs
  3. Data Collection & Processing
  4. Feature engineering
  5. Modeling & evaluation
  6. Deployment & serving

2.2 Clarify the Problem (RecSys)

2.2.1 Why recommend content to users

2.2.2 What are the metrics

2.2.3 What kinds of contents

2.2.4 How to blend contents (e.g., in-network, out-of-network, ads)?

2.2.5 what are operational parameters of RecSys

2.3 High-Level Design

2.3.1 the high-level design for RecSys

  1. Candidate generation: to produce hundreds or thousands of contents
  2. Filters: to eliminate 20% to 90% remaining candidates
  3. Pre-ranking: narrow down to a few hundred candidates
  4. Full-ranking: assign scores to contents
  5. Reranking: rerank candidates based on a variety of factors, such as diversity, freshness

2.3.2 cold start problem

2.4 Data Collection

2.4.1 What datasets and how to collect

2.4.2 Possible biases and solution in the dataset

2.5 Candidate Generation

2.5.1 Sources of Candidate Generation

2.5.2 Steps in Candidate Generation

  1. Fetch sources: which are seeds for generating candidates, such as historical interacted items
  2. Generate candidates: with candidate generation algorithms
  3. Filter Candidates: by merging and pruning

2.5.3 Algorithms for Candidate Generation

2.5.4 How to merge and prune candidates to limit #Candidates

2.5.5 Why not rank items in candidate generation

2.5.6 how to handle in large-scale system, not every candidate can be scored in real-time

2.6 Pre-Ranking

2.6.1 What is pre-ranking (light-ranking)

2.6.2 Evaluation Metrics for Pre-Ranking Model

2.6.3 Algorithms for Pre-Ranking Model

2.6.4 How to handle Pre-rank in large scale

2.7 Featuer Engineering

2.7.1 What features to use

2.7.2 How to handle textual or id-based features

2.7.3 How to handle counting features

2.8 Modeling & Evaluation

2.8.1 What the heavy ranking model learns?

2.8.2 Algorithms for full-ranking model?

2.8.3 Evaluation Metrics